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Abstract 

~ Random projection algorithm is of interest for constrained optimization when the constraint set is not known in 

advance or the projection operation on the whole constraint set is computationally prohibitive. This paper presents a 
distributed random projection (DRP) algorithm for fully distributed constrained convex optimization problems that 
■ can be used by multiple agents connected over a time-varying network, where each agent has its own objective 

C) \ function and its own constrained set. With reasonable assumptions, we prove that the iterates of all agents converge 

D ■ to the same point in the optimal set almost surely. In addition, we consider a variant of the method that uses a 

mini-batch of consecutive random projections and establish its convergence in almost sure sense. Experiments on 
£f) • distributed support vector machines demonstrate fast convergence of the algorithm. It actually shows that the number 

of iteration required until convergence is much smaller than scanning over all training samples just once. 

U: 

O ! I. Introduction 

i-^ , A number of problems that arise in sensor, wireless ad hoc and peer-to-peer networks can be formulated as 
& | convex constrained minimization problems CD-ill]- The goal of the agents connected over such networks is to 
£3 cooperatively solve the following optimization problem: 



in 



in 



(N: mmf(x) = Y / f i {x) s.t. x € X 4 P| X h (1) 



i=l i=l 

where each fi : M d — > M is a convex function, representing the local objective function of agent i, and each 
Xi C M rf is a closed convex set, representing the local constraint set of agent i. The complete problem information 
is not available at a single location. This is because i) there is no central node that facilitates computation and 
communication and ii) it is often not possible for one agent to keep all the objective and constraint components 
CN ■ due to memory, computational power, or privacy constraints. In addition, the network topology itself may change 
with time due to agent mobility or link failures. Therefore, an optimization algorithm for solving such problems 
must be distributed and robust, so that each agent exchanges its information only with its immediate neighbors and 
^ the algorithm has to be adaptive to the changes in the network topology. 

*-h . In this paper, we propose a distributed random projection (DRP) algorithm for problem CD, where the constraint 
set is defined as the intersection of finitely many simple convex constraints. That is, X, L = f]j e j. X- , where Ii is a 
fmiteQ (a formal definition of Ii is in Section JT]). In our algorithm, each agent i maintains its own iterate sequence 
{xi(k)}. At each iteration, each agent calculates weighted average of the received iterates (from its neighbors) and 
its own iterate, adjusts the iterate by using gradient information of its local objective function fi and projects onto 
a constraint component that is selected randomly from its local constraint set Xi. The projections are performed 
locally by each agent based on the random observations of the local constraint components. In particular, agent i 
observes a constraint component at time k, where Qi(k) G Ii is a random variable. 

Our primary interest is in the case when the whole constraint set Xi for an agent i is not known in advance, but its 
component is revealed through random realizations X^ 1 ' ^ . For example, in collaborative filtering for recommender 
systems, user data is huge and distributed over multiple machines. Users frequently change and update their 
preferences in real time so the constraint set of this problem is usually not known in advance. Another case of interest 
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'The finiteness of /; is not really crucial. The developed results also apply to the case when the index sets /; are infinite. 
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is when the whole constraint set X{ is known in advance but it has a huge number of components. For example, in 
text classification problems, model parameters are trained based on hundred thousands or more text samples and 
each sample constitutes a constraint component (usually a halfspace) 0. In such a case, the projection operation on 
the whole constraint set X{ is computationally prohibitive if any of the traditional (sub)gradient projection methods 
are used. In Section |VII[ we will experiment with Support Vector Machines to classify three text data sets. 

In the optimization literature, algorithms of two categories have been proposed for problem (Q3: the Markov 
incremental algorithm and the distributed subgradient algorithm. In the Markov incremental algorithm studied 
in (6], Q, the agents maintain a single estimate sequence that is sequentially updated by one agent at a time. 
When an agent receives the estimate, it updates the estimate using its local objective function and passes it to a 
randomly selected neighbor. The update order is driven by a time inhomogeneous Markov chain (as the network 
topology is time varying). Whereas in the distributed subgradient algorithms, each agent maintains its own estimate. 
It communicates the estimate with its neighbors and updates it using the local objective and constraint information. 
Algorithms of this type requires a consensus over all agents for convergence. However, in some distributed problems 
it is important that each agent maintains a good estimate at all times. For example, in a distributed online learning, 
each node is expected to perform in real time. Our DRP algorithm is in the distributed subgradient algorithm 
category. 

The related distributed optimization literature includes ||8l- |[T5l , which are concerned with convex but uncon- 
strained problems, and |[T6l - |[T8l where constrained problems are considered. The most relevant to the work in 
this paper are |[T9l - ll22l where, as in the DRP algorithm, the constraint set is also distributed across agents and 
each agent handles its own constraint set only. In |[T9l , the convergence analysis is done for a special case when 
the network is completely connected. The work in EOl , ETl extends the algorithm and its analysis to a more 
general network including the presence of noisy links, while ll22l extends it to a general Markovian network model. 
Unlike |[T9l and EOl . where each agent can perform projections on its entire constraint set, this paper addresses 
the case when such projections are not possible or computationally prohibitive. Related to this work are also the 
distributed algorithms for estimation and inference problems that have been proposed and studied by Sayed et 
al. ||23l - ||26l . On a much broader scale, the work in this paper is related to the literature on the consensus problem, 
where each agent starts from an initial value and ends by converging to a value common to all agents (see for 
example 0, Il27l-ll30l). 

The contribution of this paper is mainly in two directions. First, we propose a novel distributed optimization 
algorithm that is based on local communications of agents' estimates in a network and a gradient descent with 
random projections. Second, we study the convergence of the algorithm and its variant using a mini-batch of random 
projections. To the best of our knowledge, there is no previous work on distributed optimization algorithms that 
utilize random projections. Gradient and subgradient random projection algorithms for centralized (not distributed) 
convex problems have been proposed in ||3TI . Also, the related work is the (centralized) random projection method 
for a special class of convex feasibility problems, which has been proposed and studied by Polyak ||32l . 

The rest of the paper is organized as follows. In Section JIJ we introduce the problem of interest, formally 
describe our algorithm and state assumptions on the problem and network. In Section Jill we state some results 
from the literature that we use in the convergence analysis. In Section ITVl we derive two important results that will 
play crucial roles in the convergence analysis. In Section |VJ we study the almost sure convergence property of our 
DRP algorithm. We provide an extension of the algorithm to a variant that uses a mini-batch of random projections 
and we state a convergence result for this extension in Section [Vl] As an application of our DRP algorithm and 
its mini-batch variant, in Section I VIII we introduce a linear SVM formulation, discuss how to apply the algorithm, 
and present some experimental results on binary text classification tasks. Section IVIIII contains concluding remarks 
and future directions. 

Notation A vector is viewed as a column. We write x T to denote the transpose of a vector x. The scalar product 
of two vectors x and y is (x, y). We use a subscript i to denote an agent i. An index k with parentheses is devoted 
to represent a time. For example, Xj(fc) is the iterate of an agent i at time k. We use ||x|| to denote the standard 
Euclidean norm. We write dist(x, X) for the distance of a vector x from a closed convex set X, i.e., dist(x, X) = 
min^ g ^ \\v — x\\. We use n^[x] for the projection of a vector x on the set X, i.e., f\x{x] = argmin„ e ^- \\v — x\\ 2 . 
We use Pr{Z} and E[Z] to denote the probability and the expectation of a random variable Z. We abbreviate 
almost surely and independent and identically distributed as a.s. and iid, respectively. 
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II. Problem Set-up, Algorithm and Assumptions 

A. Optimization over a Network 

We consider a constrained convex optimization problem (Q3 that is distributed over a network of m agents, 
indexed by V = {1, . . . , m}. The function fi and the constraint set Xi in ([]]) are private information of agent i 
(not shared with any other agent). Collectively, the agents are responsible for solving problem ([[}. 

We are interested in the case when each constraint set Xi is the intersection of finitely many closed convex sets. 
Without loss of generality, let X be the intersection of n closed convex sets. Let I = {1, . . . , n} be the index set, 
and let Ii, i G V, be a partition of I (i.e., I = |J™ 1 Ii and Jj n Ij ■■ = for i 7^ j) such that each Jj is associated 
with the local constraint set Xi of agent z, i.e., 

Xi = HjeiiX? for a finite index set Ii, 

where the superscript is used to identify a component set. Each component set X- is assumed to be a "simple set" 
for the projection operation. Examples of such simple sets include a halfspace X- = {x G M. d \ (a, x) < b}, a 
box Xl = {x G M. d I a < x < (3} (the inequality is component-wise) and a ball X- = {x G M. d \ \\x — v\\ < r}, 
where a, a, /3,v G M. d and b, r G R. In such cases, the projection on the set Xi can be complex, especially when 
the number of components is large, while the projection on each component X- has a closed form expression. 

We use the following assumption for the functions fi and the sets X- . 

Assumption 1: Let the following conditions hold: 

(a) The sets X- , j G Ii are closed and convex for every i G V. 

(b) Each function fi : M. d — > M is convex. 

(c) The functions fi, i G V, are differentiable and have Lipschitz. gradients with a constant L over M. d , 

||V/i(x)-V/i(y)||<L||x-»|| forallx,yGM d . 

(d) The gradients Vfi(x), i G V are bounded over the set X, i.e., there exists a constant Gj such that 

||V/i(x)|| < Gf for all x £ X and all i G V. 

When each fi has Lipschitz gradients with a constant Li, Assumption Q2c) is satisfied with L = max^v Li. 
Further note that Assumption did) is satisfied, for example, when X is compact. 

As mentioned earlier, the agents are collectively responsible for solving problem CO, without sharing their 
private knowledge of individual objective functions fi and the constrained sets Xi. To accommodate such a task, 
the agents are assumed to form a network, wherein each agent communicates its iterates to its local neighbors. 
More specifically, at each time k, the network topology is represented by a directed graph G(k) = (V, E(k)), where 
E{k) C V x V. A link G E(k) indicates that agent i has received information from agent j at time k. We let 
Ni(k) denote the set of agents who send information to agent i, i.e., Ni(k) = {j G V \ G E(k)}. We assume 
that i G Ni{k) for alii G V and for all k. 

B. Distributed Random Projection Algorithm (DRP) 

To solve the problem CO with distributed information access, we propose an iterative gradient method with random 
projections. Let Xi{k) G M. d denote the estimate of agent i at time k. At time k, each agent sends the estimate to its 
neighbors (represented by the graph (V, E{k)). Upon receiving the estimates Xj(k) from its neighbors j G Ni(k), 
each agent i updates according to the following two steps: 

v i(k) = ^2 Wij(k)xj(k) (2a) 

Xi(k + 1) = V\ x n lW [vi(k) - a k V fi(vi(k))} , (2b) 

where a k > is a stepsize at time k and Xi(0) G M. d is an initial estimate of agent i (which can be random). 

In the above, relation d2al ) captures an information mixing step, while d2bl captures a local minimization and 
feasibility update step using a random projection. In (|2al . the iterate Vi(k) is a weighted average of agent i's 
estimate and the estimates received from its neighbors j G Ni(k). Specifically, Wij{k) > is a weight that agent i 
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places on the estimate Xj(k) received from a neighbor j G Ni(k) at time k, where the total weight sum is 1, i.e., 
YljeNJk) w ij(k) = 1 f° r eacn agent i. The step (l2ab can be equivalently represented as 

m 

Vi(k) = J2[W(k)] ijXj (k) (3) 
i=i 

by letting Wij(k) = for whenever j G" Ni(k), and using [W]ij to denote the (i, j)th entry of a matrix W. 

In d2bl ), agent i adjusts the average along the negative gradient direction of its local objective f\. At time 
k, agent i also observes a random realization of its local constraint component set X i . To reduce the feasibility 
violation, it projects its current estimate on this set. The random variable Oj(fc) takes values in the index set Ii at 
all times k. In this way, instead of projecting onto the whole local constraint set Xi, agent i projects only on a 
component set xf"^ which is randomly selected at time k. Note that the updated estimate Xi(k + 1) may not lie 
in Xi since X, L C xf 1 . 

Through the updates (|2aj i and (|2bl , agents combine their information and consider their own optimization problem 
of minimizing f\ over the set X^ There is neither a central node governing the whole process nor additional 
constraints enforcing consistency. Nevertheless, with this simple update rule, our algorithm finds the optimal solution 
and all agents eventually arrive at a common optimal solution (all Xi{k) converge to some x* G X* , as shown in 
Section [yj. 

Note that algorithm (l2ali-(l2bl is similar to the distributed projected subgradient algorithm in ||T9l except for the 
randomization over the components of the set Xi in ( |2bl >. At each iteration of the algorithm in |[T9l , a projection 
is performed on the entire constraint set Xi, which can be prohibitively expensive when Xi is itself an intersection 
of many sets. In addition, unlike the method in Ifl9ll , DRP can also handle the cases when the projection on the 
entire set Xi is not possible since the set Xi may not be known in advance. 

The challenges in convergence analysis of the DRP algorithm are posed mainly by its distributed nature, through 
the effects of the time-varying network, and by the projection errors associated with using projections on components 
X- , j G Ii of the set Xi = Hj^X^ instead of the projection on the set Xi. The fact that the DRP relies on a 
random component X- poses particular difficulties, as one needs to characterize the impact of the random projection 
errors, which is closely related to errors in "set-approximations". To handle these difficulties, we make several mild 
assumptions. We make an assumption on the random set processes {fij(fc)}, i G V, that allows us to characterize 
the projection errors. For the network we assume that it is sufficiently connected in order to properly conduct 
the information among the agents. Finally, we assume that the agent weights are also properly chosen to ensure 
that each agent is equally influencing every other agent. These network assumptions have been typically used in 
distributed optimization algorithms over a time-varying network (see e.g. lfT3l . |[T4l . |[T6l . Il33l - ir35l ). In the next 
subsections, we state our assumptions on the random set processes {Jli(fc)}, i E V, the network and the weight 
matrices W(k). 



C. Assumptions on Random Set Process 

For the random sequences {Qi(k)}, i G V, we assume the following. 

Assumption 2: The sequences {Qi(k)}, i G V, are iid and independent of the initial random points Xj(0), i G V. 
We have %\ = Pr{fij(A;) = j} > for all j G I { and i G V. 

The variable Qi(k) can be viewed as a random sample at time k of a random variable Oj that takes values j G Ii 
with probability irj. In some situations the probability distributions 7Tj may be dictated by nature and agent i cannot 
control them. In situations where the agents have all sets X- , j G Ii available, each agent i can choose a uniform 
distribution 7Tj over the set Ii. 

The next assumption is crucial in our analysis. 

Assumption 3: For all i G V, there exists a constant c > such that for all x G M. d , 



dist 2 (x, X) < cE dist 2 (x,^, 



(4) 



Assumption [3] is satisfied, for example, when each set X- is given by either linear inequality or a linear equality, 
or when the intersection set X has a nonempty interior. In the first case, one can verify that the assumption holds 
by using the results of Burke and Ferris on a set of weak sharp minima ll36l . In the second case, one can use the 
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ideas of the convergence rate analysis for the alternating projection algorithm of Gubin, Polyak and Raik in ||37l . 
In either case, the constant c depends on the probability distributions 7r, and some geometric properties of the sets. 



D. Assumptions on the Network and Weight Matrices 

We rely on the graphs (V, E{k)), k > to represent the time-varying network. We make two assumptions. 

Assumption 4: [Network Connectivity] There exists a scalar Q such that the graph (v, U^=o Q-i + ^)) 
is strongly connected for all k > 0. 

Assumption |4] ensures that the agents communicate sufficiently often so that all functions and all constraints (/j's 
and Afj's) influence the iterates of all agents. 

Next, we make the following assumption on the edge weights (defined below (O). 

Assumption 5: [Doubly Stochasticity] For all k > 0, 

(a) [W{k)]ij > and [W(fc)]y = when j N^k), 

(b) E™=i W[(k)]ij = 1 for all i G V, 

(c) There exists a scalar rj G (0, 1) such that [W(A;)]jj > r) when j G Ni(k), 

(d) E™i[W(A% = lforall i€7. 

Assumption [2a) states that the weights respect the network topology at any time k. Assumption [2b) means that 
each agent calculates a weighted average of the estimates obtained from its neighbors. Assumption [2c) ensures 
that each agent gives sufficient weights on the information received. Assumption [5]cl) together with Assumption [4] 
ensure that each agent is equally influential in the long run so that the agents arrive at a consensus on an optimal 
solution. 



III. Preliminaries 

In this section, we state some definitions and results from the literature, which will be used in later sections. 
Convexity of Euclidean norm and its square. Both the Euclidean norm and its square are convex functions, i.e., for 



any vectors v\,...,v m G Mr and nonnegative scalars /3x , . . . , f3 n 



< 



5> 



such that Yj=i Pi 

2 m 



1, we have 



^2 PiVi 



< 



Eft 



Vi 



(5) 



X is continuous and 



Non-expansive projection property. We state a projection theorem (see ||38l for its proof). 

Lemma 1: Let X C M. d be a nonempty closed convex set. The function 11 ^ : M. d - 
nonexpansive, i.e., 

(a) \\U x [x] - n x [y]\\ < \\x-y\\ for all x, y G R d . 

(b) ||n^[x] - y\\ 2 < \\x - y\\ 2 - ||n^[x] - x\\ 2 for all x G M d and for all y G X. 

Matrix convergence. Recall we defined W(k) to be the matrix with (i,j)th entry equal to Wij(k). From Assump- 
tion |5J the matrix W(k) is doubly stochastic. Define for all k, s with k > s > 0, 



s ) = W{k)W{k -!)••• W(s + l)W(s), 



(6) 



with 3>(k,k) = W(k) for all k > 0. We state the convergence property of the matrix s) (see |[T4l for its 
proof). Let s)]ij denote the (i,j)th entry of the matrix $(k, s), and e G M m be the column vector whose all 
entries are equal to 1. 

Lemma 2: Let Assumptions [4] and [5] hold. Then, 

(a) linife^oo s) = ^ee T for all s > 0. 

(b) - ^| < ^ fc_S for all > s > 0, where = (l - ^)- 2 and = (l - ^) 5. 
Supermartingale convergence result. In our analysis of the DRP algorithm, we also make use of the following 
supermartingale convergence result due to Robbins and Siegmund (see ll39l Lemma 10-11, p. 49-50]). 

Theorem 1: Let {v/.}, {u^}, {a^} and {b^} be sequences of non-negative random variables such that 



E[v k+ i \F k ] < (1 + a k )v k -u k + b k for all k > a.s. 
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where denotes the collection vq, . . . , v^, uq, . . . , u^, a®, . . . , and 60, ... , Also, let ^2^=0 a fe < 00 an d 
X]fclo < 00 a.s. Then, we have lim^oo = v for a random variable v > a.s., and Xl^Lo u k < 00 a s- 

The above theorem is the key in our convergence analysis. Specifically, once we show that Theorem Q] applies 
to Vk+i = Y^iLi \\ x i{k + 1) — x *\\ 2 f° r an optimal solution x*, the rest of the proof just builds on the implications 
of the theorem. 

Scalar Sequences. We also use the convergence result for scalar sequences (see Lemma 3.1 in lfT6ll for its proof). 
For a scalar f3 and a scalar sequence {"y(k)}, we consider the convolution sequence Yle=o ft k "S'CO- 
Lemma 3: If lim^oo j(k) = 7 and < j3 < 1, then lim^oo Ylt=o /3 k ~ e j(£) = j^. 

IV. Basic Relations 

Our convergence analysis is based on a critical relation that captures the decrease in values YsZLi \\xi(k+l) —x*\\ 
as the algorithm progresses. Such a relation is provided in LemmalU which is taken from ll3TI where it was developed 
for a centralized algorithm. This basic relation is further refined to take into account the distributed nature of the 
algorithm. Specifically, in LemmalU we show that the weighted averages Vi(k) of the iterates approach the constraint 
set X asymptotically. Then, in Lemma 13 we prove that the agents' disagreement on v.i{k) is diminishing with the 
number k of iterations. The proof of Lemma [7] relies on an auxiliary result taken from lPl6l . which is provided in 
Lemma [6] 

In the analysis, we will rely on the expectation taken with respect to the past history of the algorithm, which 
we define as follows. Let be the c-algebra generated by the entire history of the algorithm up to time k — 1 
inclusively (realizations of all the random variables but not the realizations of the indices at time k), i.e., for all 

k > 1, 

F k = {^(0), i g V} U {fiiOO; < t < k - 1, i G V}, 

where Fq = {xj(0), i G V}. Therefore, given Tk, the collection Xj(0), . . . , Xi(k) and fj(0), . . . , Vi(k) generated by 
the algorithm dlali- dlbl is fully determined. 



A. Basic Iterate Relation 

The following lemma is from the paper ||31"1 Lemma 1], which provides relation among the iterate obtained after 
one step of the algorithm (|2al ), a point in the feasible set X and an arbitrary point in M. d . 

Lemma 4: Let 3^ be a closed convex set such that 3^ Q Let the function ^ : ]R d — )■ IR be convex and 
differentiable over M. d with Lipschitz continuous gradients with a constant L. Let y be given by 

y = V\y[x — aV(j)(x)] for some x G M. d and a > 0. 

Then, we have for any x G y and z G M d , 

\\y - xf < (1 + A T a 2 )\\x - xf - 2a((f)(z) - <j)(x)) 



, h 2aL ) 



o 11 mo 

-\\y — x\\ + 



+ B T a 2 \\V<i ) (x)\\\ (7) 

where A T = 8L 2 + 16rL 2 , B T = 8r + 8 and r > is arbitrary. 

Lemma [4] provides a measure of progress toward an optimal point of the function <p when moving from a point 
x in the direction opposite of the gradient V^(x). Specifically, if x* is a minimizer of 4>{x) over 3^, the lemma 
(with x = x*) will provide us with a relation between the distances \\y — x*\\ and \\x — x*\\, where the point y 
is resulting from a projected-gradient step away from the point x. The lemma provides a relation that helps us 
measure the progress of a gradient-based algorithm for minimizing <ft. Lemma |4] with a specific identification of 
the terms, will be a starting point for our convergence proof. 



B. Projection Estimate 

In the next lemma, we show that the sequences {vi(k)}, i G V, approach the constraint set X. The result does 
not say that these sequences necessarily have accumulation points in X, but rather that the distance between Vi(k) 
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and the set X tends to 0, as k — > oo, for all i. Furthermore, these distances converge to rather fast, as the sum 
of all squared distances over time is finite, which is a critical relation in our analysis. 

Lemma 5: Let Assumption 1 hold. Let each W(k) be doubly stochastic, and let XlfcLo a k < 00 • Then, 



^ dist 2 (vi(k), X) < oo for all i S V a.s. 



k=0 



Proof: In Lemma 01 let y = Xi(k + 1), x = Vi{k), y = xf"^ k \ a = a k , (/> = fi and r = c where c is the 



constant from Assumption 3. Then, for any x € X (also in X ■ ,( ', since X C A£ liW ) and any 2 e M d , we obtain 



- II 2 



||£Ci(fe + 1) - x|| 2 < (1 + Aa^)||«i(fc) - x 

- 2a k (f l (z) - fox)) - jWx^k + 1) - Vi (k)\\ 2 

+ (J- + 2a fc L^ ||^(fc) - z|| 2 + Ba 2 k G 2 . 

where A = 8L 2 + 16cL 2 and B = 8c + 8. Here, we have also used Assumption Old), according to which the 
gradients V fi{x) are bounded on the set X, i.e., ||V/i(n^[vj(fc)])|| < Gf for all k and i. 
Letting x = z = \~\x[vi(k)] in the preceding relation, we find 

||zi(fc + l) - n x [vi(k)]\\ 2 < (l+Aal)dist 2 (vi(k),X) 



(8) 



- -\\xi(k + 1)-Vi 



( |- + 2a fc L ) dist 2 ( Wi (fc), #) + Ba^G 2 . 



By the definition of the projection, we have 

dist (a* (A: + 1), X) = \\ Xi {k + 1) - n x [ Xi {k + 1) 

< \\ Xi (k + 1) - n x [ Vi 



\\xi(k + 1) — > \~\ x n z( k) [vi(k)] - Vi(k) 

= dist(vi(k),X? i{k) ). 

Upon substituting these estimates in ©, we obtain 

dist 2 (xi(k + 1), X) < (1 + yk^dist 2 ^*;), Af) 
- ?dist 2 (^), Af^) 



( A + 2a k L ) dist 2 (A;), Af) + Ba 2 k G 2 f . 



Taking the expectation in (|9]) conditioned on T k , and using 



dist>^),Af w ) | > -dist^uiCfe),^) 



which follows by Assumption 3, we find that almost surely 



&\st 2 (xi(k + 1), X) | T k < (l+Aal)dist 2 (vi(k), X) 



£ - 2a fc L ) dist 2 (ui(/c), X) + Ba k G 2 . 



(9) 



(10) 



s 



By using the definition of Vi(k) (as a convex combination of Xj(k) in ©) and the convexity of the distance 
function x h-> dist 2 (x,^) (see ED p. 88]), we find that 

m 

dist 2 (?;*(£:),*) < ^[W(k)] tj dist 2 ( Xj (k), X). 
j'=i 

The preceding relation and (TTOl i imply that almost surely for all k > 0, 

E dist 2 (xi(/E + l),^) | 

m 

< (1 + Aag) ^[^(fcJly dist^^-Cfc), X) 



8c 



2a fc L j dist 2 (^(A;), X) + Ba\G). 



Finally, by summing over all i and using the fact that each W(k) has column sums equal to 1, we arrive at the 
following relation: almost surely for all k > 0, 



171 

:[53dist 2 (a*(fc + l),*) | T k 



8=1 



< (1 + Aa|)J^dist 2 (2; i (A;),A') 



3 \ - 

— - 2a fc L > dist 2 X) + mBo? k G 

8c / z — ' 



Since Y^k=o a k < °°» follows that ctfe — > 0, implying that there exists k such that ^ — 2a k L > for all k >k. 
Therefore, for all k>k, all the conditions of the supermartingale theorem are satisfied (Theorem [l). By applying 
the supermartingale theorem (to a time-delayed process from k onward) we conclude that 

oo 

dist 2 (vi(k), X) < oo for all i £ V a.s. 

k=0 



Lemma [5] shows that the points Vi(k) are getting close to the set X relatively fast, as k — > oo. If the set X 
was compact, this would imply that all accumulation points of {vi(k)} would lie in the set X. However, there 
would be no guarantee that the accumulation points of any two sequences {vi(k)} and {vj(k)} would be the same. 
Even worse, Lemma [5] would give no information about optimality of any of the accumulation points. In the next 
section, we provide a result that helps us claim later on that any two sequences {vi(k)} and {vj(k)} have the same 
accumulation points. 



C. Disagreement Estimate 

We now quantify the agent disagreements in time. We measure the disagreements by using the norm v{k) || 

of the differences between the estimates Vi{k) generated by different agents according the algorithm d2a^-(|2bl and 
their instantaneous average v(k) = — YHLi v i(k). The proof of our result relies on a lemma (adopted from lTT7l 
Theorem 4.2]), which states that the iterates generated by a "perturbed" consensus protocol are guaranteed to arrive 
at a consensus when the perturbations are small in some sense. This lemma is provided next. 

Lemma 6: Let Assumptions H] and [5] hold. Consider the iterates generated by 

m 

9 i {k + l) = Y J [W{k)]i j 6 j {k) + e l (k) for all i G V. (11) 
j'=i 

Suppose there exists a non-negative non-increasing scalar sequence {a k } such that J2T=o a fcll e «(^)ll < 00 f° r au 
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i G V. Then, for all i,j G V, 



J2<*k\\0i(k) -0^)11 <oo. 



k=0 



Using Lemma |6l we prove the following disagreement results that will be important in our analysis later. 
Lemma 7: Let Assumptions [T] [4] and [5] hold. Also, assume that the stepsize sequence {a k } is non-increasing 
and such that J2T=o °i < 00 ■ Define 



ei (k) = Xi(k + 1) - for all i G V and > 0. 



Then, we have almost surely 



£ 

k=0 



^a k \\vi(k) 



2 < oo for all i G V, 



< oo for all i G V, 



(12) 
(13) 



fc=0 



where = £ ££i 



Proof: Define Zi(k) = Y\x[vi(k)}. Consider ||ej(/c)||, for which we can write 

||ei(fc)|| < \\xi(k + 1) - Zi(k)\\ + \\ Zi (k) - Vi (k)\\ 
U x n t m [vi(k) - a k V fi(vi(k))] - Zi(k) 



+ \\Zi(k) - V iy 

since X C Xf- l{k) and Zi {k) G Af, we have Zi(k) G Using the projection theorem (Lemma [T}, we obtain 



<\\vi(k) - Q fc V fi{vi(k)) - Zi(k)\\ + \\zi(k) - Vi(k)\\ 
<2\\ Vi {k) - Zi {k)\\ + a k \\V h{vi{k))\\ 
<2\\ Vi {k) -Zi{k)\\ + a k \\V h{zi{k))\\ 

+ a k \\Vf i (v i (k))-Vf i (z i (k))\\ 
<(2 + a L)\\v i (k)-z i (k)\\ + a k G f , 



(14) 



where the last inequality follows by using a k < ao, the Lipschitz gradient property of and the gradient 
boundedness property (Assumptions [TJc) and Q2d)). Therefore, applying (a + b) 2 < 2a 2 + 2b 2 in inequality (fT4l . 
we have for alH G V and k > 0, 



|e;(fc)|| 2 < 2(2 + a L) 2 || Vi (A:) - z;(£:)|r + 2aj^ 



(15) 



Recall that we defined Zi(k) = n^[uj(A;)], so we have ||«t(fe) — Zi 



dist(vi(k), X). In the light of Lemma [2 



we also have ^^L ~~ z i(^)ll 2 < 00 almost surely. Since J2h=o a 1 < °°> we conc lude that 

oo 

||ei(/c)|| 2 < oo for all i S V a.s. 

fe=o 

By applying the inequality 2ab < a 2 + b 2 to each term afc||ej(fc)||, we see that for all i G V almost surely 



< oo. 



k=0 



k=0 



k=0 
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Now, we note that xi(k + 1) = Vi(k) + ei(k) with Vi(k) = Y^j=i\W{^)]ij x jQ t ) an d the error ei(k) satisfying 
^2^=0 a k\\ei{k)\\ < oo almost surely. Therefore, by Lemma [6] it follows that 



ak\\xi(k) — Xj(k)\\ < oo for all i and j a.s. 



(16) 



k=0 



Next, we consider \\vi(k) — v(k)\\. Recalling that Vi(k) = Y^=i\W{k)]ij x j{k) (see ©) and W(k) is stochastic 
(Assumption [5]), and by using the convexity of the norm, we obtain 



V; 



m 

£ E 



where in the last equality we use < [^(A;)]^ < 1 and v(k) = ^Yle=i x t(k)> which holds since Vi(k) = 
Y^?=i\W Xj(k) and each W(k) is doubly stochastic. Therefore, by using the convexity of the norm again, 
we see 

i=i j=i e=i 

We thus have 

m m 

(*k\\vi(k) - v{k)\\ < — Y] Y] \\xj(k) - xe(k)\\ , 

j=i i=i 

and by using the relation in (fT6l ). we conclude that 



E 



< oo for all i S V a.s. 



fe=0 



V. Almost Sure Convergence of DRP Algorithm 

We are now ready to assert the convergence of the method d2aj-(T2bl using the lemmas established in Section 
HVl To outline the rough idea of the proof, let us note that Lemma [5] allows us to infer that Vi(k) approaches the 
set X. Lemma |7] will allow us to claim that any two sequences {vi(k)} and {vj(k)} have the same accumulation 
almost surely, under some mild assumptions on the stepsize. To claim the convergence of the iterates to an optimal 
solution, it remains to relate the accumulation points of {vi(k)} to the optimal solutions of problem £[]). This last 
piece is provided by the iterate relation of Lemma |4] supported by the supermartingale theorem. 

From here onward, we use the following notation regarding the optimal value and optimal solutions of problem ([]]): 

/* = min/(x), X* = {x E X \ f(x) = /*}. 
We have the following convergence result. 

Proposition 1: Let Assumptions [THS] hold. Let the stepsize be such that a/c = oo and E^£L a k < °°- 

Assume that problem CO) has a nonempty optimal set X*. Then, the iterates {xi(k)}, i G V, generated by the 
method (I2ab- (l2b1) converge almost surely to some random point in the optimal set X* , i.e., for some random vector 

x* € X*, 

lim Xi(k) = x* for all i G V a.s. 

k— >oo 

Proof: We use the definition of the iterate Xi(k) in d2al)-(|2bT) and lemma [4] with the following identification: 

y = X^ k \ y = Xi(k + 1), x = vi(k), z = Zi(k) = \~\x[vi(k)], a = and r = c where c is the constant from 
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the relation dD. Thus, for any x € X, k > and i € V, we have 

||a;t(fc+ 1) -x\\ 2 < (1 + Aa 2 k )\\vi(k) - x\\ 2 
- 2a k (fi( Zi (k)) - fi(x)) - ^\\ Xi (k + 1) - viim 2 

+ (j- c + 2a k Lj \\ Vi (k) - Zi (k)\\ 2 + Ba 2 WVf^f, 

with ^4 = 8L 2 + 16cL 2 and Z? = 8c + 8. We next sum the preceding relations over i = 1, . . . , m. Also, we use the 
convexity of the squared-norm (cf. d5J) and the doubly stochasticity of the weights to obtain the following relation: 



Y^\\<k)-x\\ 2 < ^Y^Mk)]ii\\xj(k) 

i=l i=l j=l 



x\\ 2 



m j m \ 

j=l \i=l J 



- n2 



By doing so, and taking into account that the gradients ||V/j(x)|| are bounded over X by a scalar Gf (Assump- 
tion Old)), we obtain for any x € X and k > 0, 



x\\ 2 



E + 1) - ^ll 2 < (1 + E H 37 *^) 

i=l i=l 

i=l i=l 

+ I — + 2a k L\ £ II 2 +mBalG). (17) 

^ ' i=l 

Let z(fe) = -j- X)^=i an(I recall that f(x) = Ya=i fi( x )- Using z(k) and /, we can rewrite the second term 
on the right hand side in ( fTTT i as follows. 

m m 
i=l i=l 

+ (/(*(*;))-/(*))■ (18) 

We estimate the first term on the right hand side of the above equation as follows. Using the convexity of each 
function fi, we obtain 

m m 

i=l i=l 
m 

>-£||V/i(;e(fc))|| \\Zi(k) - Z(k)\\. 
i=l 

Since z(k) is a convex combination of points Zi{k) € X, it follows that € ^f. This observation and 

Assumption fUd), stating that the gradients V/j(x) are uniformly bounded for x 6 X, yield 

m m 

Y,(M^ k ))-W(^))>-Gf^2Mk)-z(k)\\. (19) 

i=l i=l 
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We next consider the term ||<Zi(A;) — z(fc)||, for which by using z(k) = — Y2eLi ze(k) we have 



\ Zi (k) - z(k)\\ 



^ rn 

-£>(*o -*(*)) 



^ m j m 

<— V - ^(fe)|| < - V ||^(fc) - v e (k)\\, 

£=i £=i 
where the first inequality is obtained by the convexity of the norm (see d5j) and the last inequality follows by the non- 
expansive projection property (LemmaQ}. Furthermore, by using ||«j(/e)— u^(fc)|| < \\v i(k) — v(k) || + \\ve(k) — v( 
we obtain for every igF, 

\\zi(k) - z(k)\\ < \\vi(k) - v(k)\\ +-Y \\ve(k) - v(k)\\. 

l=i 

Upon summing over i G V, we find that 

m m 

Y,\\zi{k)-m\\ < 2^11^(^-^)11. 

i=l i=l 

Combining relations (1201 and (fT9l l, and substituting the resulting relation in equation (fT8l ), we find that 

m m 



(20) 



1=1 



1=1 



+ (/(f(fc))-/(x)). 

Finally, by using the preceding estimate in inequality (fTTT i, we obtain for any and fe > 0, 



^ ||xi(A; + 1) - x|| 2 < (1 + Aa 2 k ) ^ ^(A;) 



ill 2 



i=i 



i=i 



2a jfe (/(^))-/(x))--x; 11^(^+1) 



i=i 



+ f-+2a fc LjJ]||^(A : )-z J (A : )|| 2 

^ ' i=l 

m 

+ 4a fc G/^ -«(&)!! + mBa 2 k G 



(21) 



i=i 



By the definition of Xj(A;+l), we have G , which implies ||xi(fe+l)— > dist(vi(k), X^'^ k ') 

for i £ V. Also, from the definition of -Zi(fc) = n^fv^A;)], we have \\vi{k) — Zi(k)\\ = dist(«i(fe), X) for i E V. 
Using these relations and letting x = x* for an arbitrary x* £ X*, from (1211 we obtain for all k > 0, 



E ||xi(A; + 1) - x*|| 2 < (1 + Aa 2 k ) IM*0 



„*l|2 



X 



i=l 



i=l 



2o*(/(f(*)) -d - !x> s t 2 (^),<*f (fc) ) 



i=l 



+ 



8c 



+ 2a k L) J2dist 2 (vi(k),X) 



i=l 



+ 4a k G f ^2\\vi(k) 



+ mBa 2 k G). 



i=\ 



By taking the expectation conditioned on T k , and noting that Xi(k), Vi(k), v(k), and z{k) are fully determined 
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x*\\ 2 I T k 



by T k , we have almost surely for all x* € X and k > 0, 

■m 

E[^2\\ Xl (k + l) 

1=1 

III 

< (1 + Aa 2 ) ^ ||x#) - x*|| 2 - 2a k (f(z(k)) - /*) 

i=l 

£dtot 3 («,(*),Af' < *>) | jf t 



-*E 

4 



i=l 



/ 3 \ m 

+ I — + 2a k L\ J^dist 2 ^^),^) 

^ ' i=i 

m 

+ 4a fc G/ ^2 \\ v i( k ) ~ v(k)\\ + mBa 2 k G 2 . 



i=i 



By Assumption [3l we have dist 2 (x, X) < cE 



since a k —t 0, by choosing k large enough so that 2a k L < ^, we have for all k > k, 



dist 2 (x, X i 



T k 



for all x G X and all i E V. Furthermore, 



5>ist 2 Mfc),;tf | j- fc 

8=1 

/ o \ m 

+ l- + 2a k L\^di S t 2 (v l (k),X)<0. 

^ ' i=l 



Thus, we obtain almost surely for all k > k and x* € X* , 



J2\\xi(k + 1) -x*|| 2 | T k 



i=l 



X 



* 1 1 2 



< (l + AaD^Hx^/c) 

i=l 

-2a k (f(z(k))-n 
m 

+ 4a k G f ^2 \\vi(k) -v(k)\\ + mBa 2 k G 2 . 



(22) 



i=i 



Since € we have f(z{k)) — /* > 0. Thus, under the assumption Y^k=o a k < 00 anc ^ Lernma[7J re i a (;i ori (|22l ) 
satisfies all the conditions of the supermartingale convergence of TheoremQ] Hence, the sequence {||xj(fc) — x*|| 2 } 
is convergent almost surely for any i £ V and x* € X* , and 

<x> 

5^a fc (/(z(fe))-/(x*))<oo a.s. 
fc=o 

The preceding relation and the condition ^/S=o a fc = co imply that 

lirninf (/(*(&)) - /(x*)) = o.s. (23) 

fc— ¥00 

By LemmaO noting that Zi(k) = \~\x[vi(k)], we have 5Zfc=i Z)i=i ~~ z *(^)ll 2 < 00 almost surely, implying 

lim \\vi(k) - Zi(k)\\ = for alH 6 V a.s. (24) 

k— too 

Recall that the sequence {||xj(/c) — x*||} is convergent almost surely for all i G V and every x* € X*. Then, 
in view of relation (l2ab . we have that the sequence {||uj(fc) — x*||} is also convergent almost surely for all % € V 
and x* € <Y*. By relation (l24l it follows that {||^(A;) — x*||} is also convergent almost surely for all i £ V and 
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x* G X*. Since — < — Y^=i — anc ^ me sequence — x*\\} is convergent almost surely 

for all i G V and x* G A?*, it follows that {||u(fc) — is convergent almost surely for all x* G Af*. Using a 
similar argument, we can conclude that {||z(A;) — x*\\} is convergent almost surely for all x* G X*. As a particular 
consequence, it follows that the sequences {v(k)} and {z(k)} are almost surely bounded and, hence, they have 
accumulation points. From relation (l23l l and the continuity of /, it follows that the sequence {z(k)} must have one 
accumulation point in the set X* almost surely. This and the fact that {||z(fc) — x*\\} is convergent almost surely 
for every x* G X* imply that for a random point x* G X* , 

lim z(k) = x* a.s. (25) 

Now, from z(k) = — Y*d=i z i(k) and v(k) = S^^(fe)> using relation (l24l i and the convexity of the norm 
(cf. <T5])), we obtain almost surely 

lim ||i;(Jb) - z(k)\\ < — V lim IWjfc) - z^(fc)|| = 0. 

lim v(k) = x* a.s. (26) 

k— ^oo 



In view of relation (1251 ). it follows that 



By relation (TT3T > in Lemma |7J we have 

liminf — v(k)\\ = for all i G V a.^. (27) 



The fact that {||uj(fc) — is convergent almost surely for all i, together with (|26l ) and (|27T ) implies that 

lim ||t)j(fc) - =0 for i G V a.s. (28) 

fc— s-oo 

Finally, from relation (TT2l in Lemma |7] we have limfc^oo \\xi(k + 1) — = for all i G V almost surely, 

which together with the limit in (l28l yields linifc^oo Xi{k) = x* for all i G V almost surely. 

■ 

VI. Distributed Mini-Batch Random Projection Algorithm 



As an extension of the algorithm in (I2at-(l2bl. one may consider an algorithm where the agents use several 
random projections at each iteration. Namely, after generating Vi{k) each agent may take (or nature may reveal 
them) several random samples Qj(k), . . . , O^(fc), where each G Ij and b > 1 is the batch-size. Each collection 

Qj(k), . . . , fi^(fc) consists of mutually independent random variables and is independent of the past realizations. 
More specifically, we have b random independent samples of the iid random variable fij(fc) (taking values in Ii). 
Using the compact form © for the update in <l2ab . in the mini-batch version of the algorithm, each agent i G V, 
performs the following steps: 

m 

v i {k)=jyV(k)] ij x j (k), (29a) 

5=1 

i$(k) = Vi(k) - a fc V/i(«i(fc)), (29b) 



iPl(k) = U x nr W h/T X (*0] for r = 1, . . . , b, (29c) 

i.b 



Xi (k + l)=^(k), (29d) 

where af- > is a stepsize at time k and Xj(0) G M rf is an initial estimate of agent i (which can be random). 
The steps in (|29b| )- (|29d| ) are the successive (random) projections on the sets X n ^ k \ . . . , X n ^ k ^ of the point 
Vi(k) - a k Vfi(vi(k)). 

The algorithm using mini-batches for random projections is of interest when the set Ii is large, i.e., the number of 
constraint set components X?, j G Ii, of the set Xi = Hj^X? is large. In such cases, taking several projection steps 
is beneficial for reducing the infeasibility of the iterates Xi(k) with respect to the set X^ More concretely, if each 
set Xi is the intersection of about 10 4 simpler sets, then one sample of these sets will render a poor approximation 
of the true set Xi, whereas 100 samples will provide a better approximation of the set. Let x be a point in the 
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feasible set X. If just one sample is considered at each iteration, by the non-expansive projection property (Lemma 
[B, the distance between the next iterate and a point in X can be estimated as: 

\\xi(k + 1) - x\\ = \\i>i(k) - x\\ < \\ipi(k) - x\\, 

whereas if 100 samples are considered for projections, 

||xi(fc + 1) - £|| = Uj 00 (k)-x\\ < ... 
< \\ti(k)-x\\ < U°(k)-x\\, 

which may yield a larger infeasibility reduction. 

For the algorithm using random mini-batch projections, we have the following convergence result. 

Proposition 2: Let Assumptions [TJ5] hold, and let the stepsize satisfy ^£L a k = oo and YH^Lo a \ < °°- 
Assume that problem ([]]) has a nonempty optimal set X*. Then, the iterates {xi(k)}, i G V, produced by the 
method (|29ab -( [29dT > converge to some random point in the optimal set X* almost surely, i.e., for some random 
vector x* G X* , 

lim Xi(k) = x* for all i G V a.s. 

k—^oo 

Proof: The proof of this result is similar to that of Proposition Q] It requires some adjustments of Lemma [5] 
and Lemma |7J The proof with these adjustments is provided in Appendix [A} ■ 

VII. Application - Distributed Support Vector Machines (DrSVM) 

In this section, we apply our DRP algorithm and its mini-batch variant to Support Vector Machines (SVMs). 
We provide a brief introduction to SVMs in Subsection IVII-AI while in Subsection IVII-B I we report our numerical 
results on some data sets that are generously made available by Thorsten Joachims. 



A. Support Vector Machines 

Support Vector Machines (SVMs) are popular classification tools with a strong theoretical background. Given a set 
of n example-label pairs {(aj, 6j)}" =1 , aj G M d and bj G {+1, —1}, we need to find a vector x = [y T £ T ] T G M d+n 
that solves the following optimization problem (a bias term is included in y for convenience): 



min/(y,£) = hy\\ 2 + CV> (30) 
s.t. bj(y, aj) > 1 - fj, £j > 0, for all j G {1, . . . , n}. 



Here, we use slack variables £j, for j = 1, . . . ,n, to consider linearly non-separable cases as well. If the optimal 
solution (y*,£*) to this problem exists, the solution y* is the maximum-margin separating hyperplane ll40l . 
For applying DRP to problem d30l , we can define fi and Xi, as follows: 



Xi = {x€ R d+n | bjfaaj) > 1 - £j, > 0, Vj G Ii}. 

where Ij is a set of indices such that U™ ^ = {1, . . . , n}, IiPiIj = for i ^ j and j G I; if and only if Xi contains 
inequalities associated with the data (a,j,bj). Note that each set X- = {x G M. d+n \ bj{y,a,j) > 1 — £j, £j > 0} is 
the intersection of two halfspaces, the projection onto which can be computed in a few steps (see Appendix |B]). 



B. Simulations 

In the section, we perform some experiments with our DRP algorithm. We refer to our DRP algorithm applied on 
SVMs as DrSVM. The purpose of the experiments is to verify the convergence and to show in how many iterations 
the proposed method can actually arrive at consensus in distributed settings. We use the DRP algorithm in (l2al)-(l2bl 
and its variant in (I29ab - (l29db with the stepsize = jrjr[ f° r k > 0. We vary the number of batches b as 1, 100 or 
1000 to observe the different convergence speed, where 6 = 1 corresponds to the algorithm in (l2ali- (l2bl . To show 
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TABLE I 

The statistics of three text classification data sets: n is the number of examples and d is the number of features, s 

REPRESENTS THE SPARSITY OF DATA. 



Data set 


Statistics 

n d s 


astro-ph 


62,369 99,757 0.08% 


CCAT 


804,414 47,236 0.16% 


Cll 


804,414 47,236 0.16% 



the effect of connectivity, we compare two different time invariant network topologies, i) a completely connected 
graph (clique) and ii) a 3-regular expander graph. The 3-regular expander graph is a sparse graph that has strong 
connectivity with every node having degree 3. 

We use 3 text classification data sets for our experiments. The data sets were kindly provided by Thorsten 
Joachims (see Q for their descriptions). Table U lists the statistics of the data sets. All of the data sets are from 
binary document classification. Since the data sets used here are very unlikely separable, we use the formulation 
d30b with C = 1. In each experiment the number n of constraints is divided among the agents equally (if n is not 
divisible by m, the m-th agent gets the remainder). To estimate the generalization (or testing) performance, we 
split the data and use 80% for training and 20% for testing. 

DrSVM is implemented with C/C++ and all experiments were performed on a 64-bit machine running Fedora 
16 with an Intel Core 2 Quad Processor Q9400 and 8G of RAM. The experiments are not performed on a real 
networked environment so we do not consider delays and node/link failures that may exist in networks. 

For stopping criteria, we first run a centralized random incremental projection 11311 on the 80% training set with 
6=1 until the relative error of objective values in two consecutive iterations is less than 0.001. i.e., 

\f(x(k)) - f(x(k + l))\/f(x(k)) < 0.001. 

We then measure the test accuracy of the final solution on the remaining 20% test set, which will become the target 
test accuracy t acc . For experiments in the distributed setting, we measure the test accuracy of every agent's solution 
at the end of every iteration. If every solution at certain iteration satisfies the target value t acc , we conclude that the 
agents arrived at a consensus and the algorithm converged. The maximum number of iterations in each simulation 
is limited to 20,000. 

Table JI] shows the results. As we do more projections per iteration, the total number of iterations required for 
convergence is less, regardless of the number of agents. For the given stopping criteria, it seems that less iterations 
are needed for DrSVM to converge as the number of agents increases. We can also observe the effect of network 
connectivity. When all the other parameters (m and 6) are the same, for most of the cases, the number of iterations 
required for the 3-regular expander graph to converge is greater or equal to that for the clique. 

The table reports the number of iterations required for all the agents to achieve the target test accuracy. Therefore, 
the total number of projections is at most the number of iterations times m times 6. This is because no projection 
is required if the current estimate is already in the selected constraint component. For example, the total number 
of projections for astro-ph with m = 6 and b = 100 is at most 4, 800(= 8 x 6 x 100). 

The runtime (or the number of calculations) of the algorithm is not only proportional to the number of projections, 
but also to the number of gradient updates. For example, for astro-ph with m = 6 and 6=1, the total number 
of projections is 4, 170(= 695 x 6 x 1), while the total number of gradient updates is 4, 170(= 695 x 6). For the 
same example with m = 6 and 6 = 100, the total number of projections is 4, 800(= 8 x 6 x 100), but the total 
number of gradient updates is only 48(= 8 x 6). In any case, the numbers are much smaller than the number 62,369 
of the training data points. This shows that DrSVM can quickly find a good quality solution before examining the 
training samples even once. 

To show the convergence (and consensus) of the algorithm, we plot in Figure Q] the objective value f(x) of 
centralized random projection (CRP) and DRP with 10 agents for example astro-ph. Note that we plot the 
convergence of the objective value instead of the solution. This is because CRP and DRP may converge to different 
optimal points as the problem (|30l may not have a unique optimal solution. For Figure [TJa) and QIb), we applied 
the random projection once and 100 times per iteration, respectively. From the figures, we can observe that the 
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— Centralized 
Distributed m=10 



(a) b = 1 (b) b = 100 

Fig. 1. f(x) vs iteration on astro-ph with 10 agents when batch size & is 1 and 100. 

TABLE II 

The results of DrSVM with two different graph topologies (clique and 3-regular expander graph) and three 
different numbers of agents (m = 2, 6, 10): tacc is the target test accuracy and b is the number of projections per 
iteration. the table shows the number of iterations for all agents to reach the target test accuracy, where '-' 
indicates that the algorithm did not converge within the 20,000 maximum iteration limit. 



Data set 


^acc 


b 


m = 2 


Clique 


3-regular expander 




m = 6 


m = 10 


m = 6 


m = 10 


astro-ph 


0.95 


1 

100 
1000 


1,055 
11 
2 


695 
8 
2 


697 
11 

2 


695 
11 

2 


11 

2 


CCAT 


0.91 


1 

100 
1000 


752 
11 
2 


511 
10 

3 


362 
8 
2 


517 
10 

3 


8 
3 


Cll 


0.97 


1 

100 
1000 


1,511 
16 

2 


1,255 
17 
2 


799 
12 
2 


1,226 
17 
2 


15 

2 



objective values of CRP and the 10 agents in DRP are almost identical. The final objective of Figure [2b) seems 
smaller than that of [TJa). This is because the stepsize at iteration 1000 is too small. 



VIII. Conclusions 

We have proposed and analyzed a distributed gradient algorithm with random incremental projections for a 
network of agents with time-varying connectivity. We considered the most general cases, where each agent has a 
unique and different objective and constraint. The proposed algorithm is applicable to problems where the whole 
constraint set is not known in advance but its component is revealed in time, or where the projection onto the 
whole set is computationally prohibitive. We have established almost sure convergence of the algorithm when the 
objective is convex under typical assumptions. Also, we have provided a variant of the algorithm using a mini- 
batch of consecutive projections and established its convergence in almost sure sense. Experiments on three text 
classification benchmarks using SVMs were performed to verify the performance of the proposed algorithm. 

Future work includes some extensions of the distributed model proposed here. First, we have assumed the gradients 
evaluated have no errors. We can consider the effects of stochastic gradient errors in the future analysis. Second, 
more robust algorithms can be developed to also handle asynchronous networks with communication delays, noise 
and/or failures in links/nodes. Third, an implementation of a real parallel computing environment will be needed 
to handle large-scale data sets. 

Appendix 

A. Proof of Proposition \2\ 

We construct the proof by adjusting the result of Lemma |4] and by verifying that Lemma [5] and Lemma [7] 
apply to the mini-batch variant of the DRP method. The basic insight that guides the proof is that the operation of 
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successive projections on components X- of the set Xi = r\j & i t X^ remains a non-expansive operation with respect 
to points that belong to the set Xi, as well as with respect to the points in the intersection set X = n™ 1 Xi. 

1 ) Basic Iterate Relation for Mini-Batch Algorithm: For the iterates generated by the mini-batch random pro- 
jection algorithm in (|29al i- (|29d| ), we have the following basic result. 

Lemma 8: Let Assumption Q] hold. Then, for any x G X, and for all i G V and all k > 0, 

||sEi(fc + 1) - x\\ 2 < (1 + A T a 2 k )\\vi(k) - x\\ 2 
-2a k {f l {z)-m)-- A U}{k)-v l {k)\\ 2 
+ (J^- + 2a k L^j \\vi{k) - z\\ 2 + B T a 2 G 2 , 

where A T = 8L 2 + 16tL 2 , B T = 8r + 8 and r > is arbitrary. 

Proof: By using the non-expansiveness property of projection operation (Lemma (Ha)), we have for arbitrary 
x £ X (since X C x{ for all j G ij), and for alH G V and k > 0, 

\\xi(k + 1) - x\\ < W^-^k) -x\\ 

<•••< U}(k)-x\\. (31) 

The intermediate iterate ip] (k) is just obtained after one projection step, 

il>l(k) = n n }w [«i(fc) - afcV/iCwiC*;))], 

so it satisfies Lemma @] with y = ip}{k), y = X^^ k \ x = Vi(k), a = a k , and (f> = fa. Thus, we have for any 
x G X and z G M d , 



- i|2 



-2a fe (/i(^)-/ i (x))-^|K 1 (fc)-^(fc)l| 2 

^ + 2a k Lj WviW-zf + BralWVfifflW 2 . (32) 

From (OTT i and (l32l . by using the gradient boundedness property of Assumption [TJd), we obtain the stated relation. 

■ 

2) Conditional Expectation Relation for Mini-Batch Algorithm: The convergence proof of Proposition [2]requires 
a relation for the iterates involving expectations with respect to the past history of the method. For this, we need 
to define a relevant cr-algebra. We let JF k be the <r-algebra generated by the entire history of the algorithm up to 
time k — 1 inclusively. Thus, T k includes the realizations of all the random variables but not the realizations of the 
indices Q,j(k), . . . , £l\ (k) at time k. Specifically, it is given by for all k > 1, 

f k ={xi(0),i G V} 

U {S%(£); < £ < k - 1, 1 < r < 6, i G V} 

where Fq = {xi(0),i G V}. 

Now, with this definition of the cr-algebra, we have the following result. 

Lemma 9: Let Assumptions Q] and [3] hold. Then, almost surely for any x G X, and for all i € V and all k > 0, 

\ Xi (k + 1) - ±|| 2 | T k < (1 + Aa 2 k )\\vi(k) - x\\ 2 
-2a k (f i (z l (k))-f i (x)) 

— - 2a k L) dist 2 (vix(k), X) + Ba 2 k G 2 f . 
8c J 1 

where Zi(k) = Ux[vi(k)], A = 8L 2 + 16cL 2 , B = 8c + 8, and c is from Assumption |3] 
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Proof: By letting z = Zi(k) and r = c in Lemma [8j we obtain 



\xi(k + l) 



- I|2 I t- 

^ 11 I Jfc 



< (1 + Aai)\\vi(k) - x 



- i|2 



2a fc (/ i (z i (fc))-/ i (x))-^E 



^(*)-«i(*)H 2 |^* 
+ (1 + 2a fc L) ^(fc) - Zi (k)\\ 2 + BagGj, 



where A = 8L 2 + 16cL 2 and S = 8c + 8. 

Since ipj(k) G X? 1 *^, by the projection property we have ||t/>* (&) — i>i(/c)|| 2 > ||n Q ^ (fc) [fi(fc)] — ^(fc)|| 2 . Then, 



> E 
= E 



(fc)-^(fc)|| 2 J- fc 



Furthermore, by Assumption [3] we have 



|n^i (fc) [vi(k))-Vi(k)\\ 2 | Uj(fe) > -dist 2 (^(/c), X). 



1 

— ( 

c 

The preceding relations and disk(vi(k), X) = \\vi{k) — Zi{k)\\ yield the desired relation. ■ 
3) Lemma \5\ and Lemma \7\ hold: Using Lemma |9l we argue that the results of Lemma [5] and Lemma |7] apply 

to the mini-batch random projection algorithm. 

Claim 1: Lemma [5] holds for the iterates generated by method (|29a) - (|29d| ). 

Proof: By letting x = \~\x[vi{k)] in Lemma |9l and noting that \\xi(k + 1) — n^[uj(A;)]|| > dist(xi(k + 1), X) 

and \\vi(k) — V\x[vi(k)]\\ = dist(t>j(fc), X), we obtain 

dist 2 (xi(k + l),X) | J fe l <(l + Aa 2 )dist 2 (vi(k),X) 



8c 



2a k L) dist 2 (^(/c), X) + Ba 2 k G 2 , 



which is the same as relation (fTOT i within the proof of Lemma [5] The rest of the proof of Lemma [5] holds exactly 
as given, and the result of Lemma [5] remains valid. ■ 
Claim 2: Lemma [7] holds for the iterates generated by method (|29a) - (|29db . 

Proof: Define ei{k) = Xi(k + 1) — Vi{k) and Zi{k) = Ux[vi{k)]. Now, consider ||ei(fc)|| for which we have 

||e<(*!)|| < \\xi(k + 1) - zi(k)\\ + \\ Zi (k) - Vi (k)\\. 

The non-expansiveness projection property and the fact Zi(k) € X C xf 1 * , for all r = 1, . . . , b, and any realization 
of these sets imply 

||xi(fc + l)-^(A;)|| 
<U\-\k)-z i {k)\\<---<U](k)-z i {k)\\ 
< \\ Vi (k) -a k Vfi(vi(k)) -Zi(k)\\. 

Therefore 

||ei(fc)|| < \\vi(k) - a k Vfi{ Vi {k)) - Zi {k)\\ + \\ Zi {k) - Vi {k)\\, 

which is the same as the first inequality in (fl4T i within the proof of Lemma [7] The rest of the proof of that lemma 
holds in verbatim, and the result follows. ■ 
4) Details of the Proof of Proposition [2]- We now connect the preceding results and provide the proof of 
Proposition [2] Starting from the relation in Lemma |9] after summing over all i € V, we can see that almost surely 
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for all x e X and all k > 0, 



in 



+ D -x\\ 2 I 



i=l 



ill 2 



< (l + Aal)J2H(k) 

i=i 

m 

-2a k J2(fi(z i (k))-f i (i)) 



i=i 



8c 



i=i 



where ^(fe) = U x [vi(k)]. 

Now, the same as in the proof of Proposition [T] using the properties of the matrices W{k) and the convexity of 
the squared-norm function (see d5j), we can show that 



\\vi{k) — x\\ 2 < y^ — x 



- II 2 



i=l 



Also, using verbatim arguments, we can show that 

in 
i=l 

■m 

> -2G, J] ||^(fc) - v(k)\\ + (/(*(*0 " /(*)) , 
i=i 

where z(k) = — Y^T=i z e(k) and v(k) = — Y^Li v e(k)- Under the conditions of Proposition |2] we have — > 0. 
Choosing k large enough so that 2a^L <\, for all k>k, we have 



/ 3 \ m 

\8~c~ 2akL jEll^W 
^ ' i=l 



< 0. 



By combining all the preceding relations, we obtain almost surely for all x G A? and all k > k, 



j2\\ Xi (k+i) 



- ||2 i x- 

3; -Afc 



i=l 



<(1 +Aa 2 )X> i (fc) 



xll 2 



i=l 



-2a fc (/(f(fe))-/(i)) 

m 

+ 4a fc G/ ^ - v(k) II + mBalG 



i=l 



Letting x = x* for an arbitrary optimal solution x* £ X*, from the preceding relation we arrive at relation (1221 in 
the proof of Proposition [T] From relation (l22l onward, the proof of Proposition Q] holds verbatim, and the stated 
almost sure convergence of the mini-batch method follows. 

B. Projection onto the Intersection of Two Half-spaces 

Given v € M. d , we are interested in solving the following optimization problem. 



1„ 

min —\\w — v\ 

w£R d 2 



(33) 



s.t. (a, w) < ft, Wi> 0, 
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where a G M. d , b G R and tOj is the i-th component of the vector w. 

The two half-spaces divide the M. d space into four parts. Therefore, there are only four cases to consider. 

1) (a, v) < b and Uj > 0. 

In this case, v is already in the intersection and w = v. 

2) (a, v) > b and < 0. 

In this case, v is projected onto the intersection of the two hyperplanes {w \ (a, w) = b} and {w \ Wi = 0}. 
Finding such a projection is equivalent to solving the following optimization problem: 

1. Il2 

min — \\w — v\\ (34) 

w£M d 2 
s.t. {a, w) = b, Wi = 0. 

The Lagrangian of the problem (l34l is 



£(w,9X) = \\\w - v\\ 2 + (\2> 



ajWj - b J + (wi, 

where 9, ( G R are Lagrange multipliers. Differentiating the Lagrangian and setting it to zero gives the 
optimality condition, 

W* ~Vi + CLiO* + C* = 0, 

w* — Vj + dj9* = 0, for j / i. 
From the primal feasibility, we have the following relations: 

w* = C* = Vi - a t e*, 

n 

i=i j¥* j¥* 



Therefore, the projection is given by 



3 if j = i, 

I Vj — dj9* otherwise. 



Let w* = [w*, . . . ,w* ]T 



J d\ ■ 

3) (a, v) > b and V{ > 0. 

In this case, v will be projected either onto the hyperplane {w \ (a, w) = b} or onto the intersection of the 
two hyperplanes {w \ (a, w) = b} and {w \ wi = 0}. Let w be the projection of v onto {w \ {a, w) = b}, i.e., 



(a,v) - 

— iFP — 



The projection of v in this case is given by 



w 



w if Wi > 0, 
w* otherwise. 



4) (a, v ) < b and Vi < 0. 

Let w be the projection of v onto the hyperplane {w \ Wi = 0}, i.e., 

w = v - (vi-b) a, 

where G R d is the vector whose i-th component is one and all the other components are zero. Then, the 
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projection of v is given by 

( w if (a, w) < b, 
w = \ 

[ u;* otherwise. 
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